Introduction to Open Data Science - Course Project

About the project

Initial feelings, expectations

I am looking forward to the course for two reasons:

  • would like to get a structured introduction to how R is used for data science, as I have learnt this previously on my own and with much help from a friend of mine
  • hopefully learn some new features too

Another initial impression of the course to me that the moodle page is rather verbose and chaotic and this hampers finding the necessary information and the assigned tasks. I hope that I will get used to this and will not miss any compulsary assignment.

I have found the course in Sisu because I still needed one credit to my Transferable skills section.

Learning experience on the R for Health Data Science

## [1] "I've found the book interesting and well written. Naturally due to my previous experience in R, I was only skim-reading it. Nevertheless I have found some useful functionalities that I haven't used before. e.g."
  1. the usage of logical operators in the filter function filter(year == 2020 | disease == COVID19)
  2. I tend to forget how to dodge columns but I was reminded and produced some nice plots
gapdata2007 %>% 
  ggplot(aes(x = continent, y = lifeExp, fill = country)) +
  geom_col(position="dodge") + theme(legend.position = "none") 

  1. I was not aware of the percent() function in tidyverse, it’s much simpler than sprintf(%2.1f%%).
  2. It was nice to learn that with plotly and html output of Rmarkdown one can create interactive htmls (as one can hover over the diagram below to highlight countries).
plot = gapdata2007 %>% 
  ggplot(aes(x = gdpPercap, y = lifeExp, color = continent, label = country)) +
  geom_point()
  
ggplotly(plot)

Regression analysis

Task description

Describe the work you have done this week and summarize your learning.

  • Describe your work and results clearly.
  • Assume the reader has an introductory course level understanding of writing and reading R code as well as statistical methods.
  • Assume the reader has no previous knowledge of your data or the more advanced methods you are using.

Analysis

Data characteristics

We got data on some sort of questionnaire responses from individuals and their results on some sort of test (exam). The questionnaire seems to refer to study habits. The origin of the data cannot be clearly identified from the metadata provided.

  df = as_tibble(read.csv('http://www.helsinki.fi/~kvehkala/JYTmooc/JYTOPKYS3-data.txt', sep = '\t'))
  
  learning2014 = read_csv('data/learning2014.csv') %>% 
    rename(points = Points, attitude = Attitude)

The original raw data consists of 183 records and 60 variables. After summarizing some questions into groups (such as questions referring to in-depth studying into a “depth” group) by taking their mean and removing observations with 0 points the processed data consists of 166 observations and 7 variables. Information on the data structure can be found below.

  str(learning2014)
## spec_tbl_df [166 x 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ gender  : chr [1:166] "F" "M" "F" "M" ...
##  $ Age     : num [1:166] 53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num [1:166] 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num [1:166] 3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num [1:166] 3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num [1:166] 2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : num [1:166] 25 12 24 10 22 21 21 31 24 26 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   gender = col_character(),
##   ..   Age = col_double(),
##   ..   Attitude = col_double(),
##   ..   deep = col_double(),
##   ..   stra = col_double(),
##   ..   surf = col_double(),
##   ..   Points = col_double()
##   .. )

Graphical overview

Let’s have a look at the distribution of variables in this data by plotting them pairwise on a scatter plot. The figure below provides detailed insight into this data. We suspect that there are differences in the male and female participants hence we stratify the data into these two groups by colors.

  colMap = c("F" = "#FF5555", "M" = "1100EE")
  # create a more advanced plot matrix with ggpairs()
  ggpairs(learning2014, mapping = aes(color = gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)),
           upper = list(continuous = wrap("cor")) ) +
    scale_color_manual(values = colMap) +
    scale_fill_manual(values = colMap)

We can read several pieces of information from the plot. First there were almost 1.5 times more females (red) in the study than males (blue) as can be seen by the barchart on the top-right. Based on the boxplots on the top the males attitude was higher towards statistics however there were no huge difference between the points of males and females based on gender. Nevertheless the positive correlation between attitude and points is observable for both males and females.

Regression model

In a linear regression analysis we are searching for a linear relationship between a predictor (feature) variable and an expected target outcome. Let’s pick 3 that are the most likely candidates based on the previous plot:

my_model <- lm(points ~ attitude + stra + surf, data = learning2014)

# print out a summary of the model
summary(my_model)
## 
## Call:
## lm(formula = points ~ attitude + stra + surf, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.0171     3.6837   2.991  0.00322 ** 
## attitude      3.3952     0.5741   5.913 1.93e-08 ***
## stra          0.8531     0.5416   1.575  0.11716    
## surf         -0.5861     0.8014  -0.731  0.46563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08

The results show that the attitude is significantly useful when predicting the exam points as the p-value for the slope of this variable is very low (\(1.93*10^{-8}\)). The F-statistic of the test also indicates that at least one of the selected variables has non-zero slope.

The t-test for each variable has the null-hypothesis that the slope of that variable in the regression equation is 0. If that is true then there is no association between the variable and the target. If the p-value is low then the result is unlikely under the null-hypothesis so usually we reject it i.e. we believe that there is association.

So fit the model again only with attitude (as that was the only one showing significance)

Interpret regression parameters

my_model <- lm(points ~ attitude, data = learning2014)

# print out a summary of the model
summary(my_model)
## 
## Call:
## lm(formula = points ~ attitude, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 1.95e-09 ***
## attitude      3.5255     0.5674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

The summary of this model shows that there is a positive correlation (with a slope of 3) between the attitude results and exam points. In lay terms this means that 1 unit increase in the attitude reflects more or less 3.5 points increase in the exam results. In addition to this there is a baseline of 11.6 for those with the lowest attitudes. The multiple R squared value gives the proportion of variation that is explained by the variables in the model. In this case it is 0.1906, so a little less than 20% of the variation is explained by the explanatory variable meaning that there are significant other factors affecting the exam results that we are omitting here.

Graphical model interpretation

par(mfrow = c(2,2))
plot(my_model, which = c(1,2,5))

The linear regression model assumes that the observations are produced by a linear model from the explanatory variable and there is additive noise with normal distribution (0 expected value) to these. The residuals represent this normal distribution, hence it’s good to check how normal these are, and if the residual values are independent of the “x” location (in other case the noise would be dependent on the observation). The first two plots refer to this. On the first one we can see that the residuals are evenly distributed over the fitted values. The Q-Q plot shows that the quantiles of the residual distribution are very close to the quantiles of a normal distribution. The last “Residuals vs. Leverage” plot indicates which variables may have the greatest effect on the parameters (it is also known that linear regression is outlier-prone), hence it could be useful to check of the model improves a lot by removing those observations.

Prediction

Beyond explaining the data linear regression models can be also used to predict unobserved results such as below.

new_attitudes <- c("Mia" = 3.8, "Mike"= 4.4, "Riikka" = 2.2, "Pekka" = 2.9)
new_data <- data.frame(attitude = new_attitudes)

# Print out the new data
new_data
##        attitude
## Mia         3.8
## Mike        4.4
## Riikka      2.2
## Pekka       2.9
# Predict the new students exam points based on attitude
predict(my_model, newdata = new_data)
##      Mia     Mike   Riikka    Pekka 
## 25.03390 27.14918 19.39317 21.86099

Logistic regression

1. Create the file

As one can see it.

2. Data characteristics

We got data on student performances at school, including Grades, number of previous class failures and connected background information such as family status, alcohol consumption etc. We had information from 2 classes: Math and Portugese. To create a joint dataset we round-averaged the grades from these 2 classes and included only students that were present in both classes (at least based on their attributes, the table does not contain unique student identifiers causing possible confusions).

# Double check the data wrangling result
  df_dc = as_tibble(read.csv('https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/alc.csv', sep = ','))
  
  df = read_csv('data/student_joint.csv')
  #full_join(df,df_dc)
  # ok. this is of same size, e.g the tables must agree.

After these operations there are 370 students with 35 variables in the data.

  str(df)
## spec_tbl_df [370 x 35] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ school    : chr [1:370] "GP" "GP" "GP" "GP" ...
##  $ sex       : chr [1:370] "F" "F" "F" "F" ...
##  $ age       : num [1:370] 15 15 15 15 15 15 15 15 15 15 ...
##  $ address   : chr [1:370] "R" "R" "R" "R" ...
##  $ famsize   : chr [1:370] "GT3" "GT3" "GT3" "GT3" ...
##  $ Pstatus   : chr [1:370] "T" "T" "T" "T" ...
##  $ Medu      : num [1:370] 1 1 2 2 3 3 3 2 3 3 ...
##  $ Fedu      : num [1:370] 1 1 2 4 3 4 4 2 1 3 ...
##  $ Mjob      : chr [1:370] "at_home" "other" "at_home" "services" ...
##  $ Fjob      : chr [1:370] "other" "other" "other" "health" ...
##  $ reason    : chr [1:370] "home" "reputation" "reputation" "course" ...
##  $ guardian  : chr [1:370] "mother" "mother" "mother" "mother" ...
##  $ traveltime: num [1:370] 2 1 1 1 2 1 2 2 2 1 ...
##  $ studytime : num [1:370] 4 2 1 3 3 3 3 2 4 4 ...
##  $ schoolsup : chr [1:370] "yes" "yes" "yes" "yes" ...
##  $ famsup    : chr [1:370] "yes" "yes" "yes" "yes" ...
##  $ activities: chr [1:370] "yes" "no" "yes" "yes" ...
##  $ nursery   : chr [1:370] "yes" "no" "yes" "yes" ...
##  $ higher    : chr [1:370] "yes" "yes" "yes" "yes" ...
##  $ internet  : chr [1:370] "yes" "yes" "no" "yes" ...
##  $ romantic  : chr [1:370] "no" "yes" "no" "no" ...
##  $ famrel    : num [1:370] 3 3 4 4 4 4 4 4 4 4 ...
##  $ freetime  : num [1:370] 1 3 3 3 2 3 2 1 4 3 ...
##  $ goout     : num [1:370] 2 4 1 2 1 2 2 3 2 3 ...
##  $ Dalc      : num [1:370] 1 2 1 1 2 1 2 1 2 1 ...
##  $ Walc      : num [1:370] 1 4 1 1 3 1 2 3 3 1 ...
##  $ health    : num [1:370] 1 5 2 5 3 5 5 4 3 4 ...
##  $ failures  : num [1:370] 0 1 0 0 1 0 1 0 0 0 ...
##  $ absences  : num [1:370] 3 2 8 2 5 2 0 1 9 10 ...
##  $ G1        : num [1:370] 10 10 14 10 12 12 11 10 16 10 ...
##  $ G2        : num [1:370] 12 8 13 10 12 12 6 10 16 10 ...
##  $ G3        : num [1:370] 12 8 12 9 12 12 6 10 16 10 ...
##  $ paid      : chr [1:370] "yes" "no" "yes" "yes" ...
##  $ alc_use   : num [1:370] 1 3 1 1 2.5 1 2 2 2.5 1 ...
##  $ high_use  : logi [1:370] FALSE TRUE FALSE FALSE TRUE FALSE ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   school = col_character(),
##   ..   sex = col_character(),
##   ..   age = col_double(),
##   ..   address = col_character(),
##   ..   famsize = col_character(),
##   ..   Pstatus = col_character(),
##   ..   Medu = col_double(),
##   ..   Fedu = col_double(),
##   ..   Mjob = col_character(),
##   ..   Fjob = col_character(),
##   ..   reason = col_character(),
##   ..   guardian = col_character(),
##   ..   traveltime = col_double(),
##   ..   studytime = col_double(),
##   ..   schoolsup = col_character(),
##   ..   famsup = col_character(),
##   ..   activities = col_character(),
##   ..   nursery = col_character(),
##   ..   higher = col_character(),
##   ..   internet = col_character(),
##   ..   romantic = col_character(),
##   ..   famrel = col_double(),
##   ..   freetime = col_double(),
##   ..   goout = col_double(),
##   ..   Dalc = col_double(),
##   ..   Walc = col_double(),
##   ..   health = col_double(),
##   ..   failures = col_double(),
##   ..   absences = col_double(),
##   ..   G1 = col_double(),
##   ..   G2 = col_double(),
##   ..   G3 = col_double(),
##   ..   paid = col_character(),
##   ..   alc_use = col_double(),
##   ..   high_use = col_logical()
##   .. )

3. Hypothesis

The 4 variables picked to examine if they have correlation with high alcohol consumption. Below I explain my a-priori expectations.

  • studytime: likely the less someone study the more time is available for consuming alcohol. It is unlikely that someone would use alcohol to mitigate stress coming from too much learning.
  • activities: the more activities someone does the less time they have for hanging out with bad student groups, I expect again negative correlation
  • higher (meaning higher education plans): Student’s with cleaner future planning are less likely to be heavy alcohol consumers.
  • freetime: probably students with too much free-time will get more bored hence consumer more alcohol to make their life exciting.

4. Verification

   p1 = df %>% ggplot() + aes(x = high_use, y = studytime, fill = high_use) + geom_violin()
   p2 = df %>% ggplot() + aes(x = activities, fill = high_use) + geom_bar(position = "fill")
   p3 = df %>% ggplot() + aes(x = higher, fill = high_use) + geom_bar(position = "fill")
   p4 = df %>% ggplot() + aes(x = high_use, y = freetime, fill = high_use) + geom_violin()
   grid.arrange(p1,p2,p3,p4)

  • The first violin-plot shows that studytime is affecting the high alcohol usage in such a way that those who study a lot are less likely to be high alcohol users, however there are exceptions. For shorter study hours there seems to be no big difference in the distribution of high and low alcohol users. My hypothesis was not exactly this, but along these lines. It is not necessary that little studying causes high alcohol consumption but a lot of study hours seems to prevent it.
  • Regarding extra-curricular activities, there is a slight increase of the proportion of high alcohol users in the no-hobby group, but this doesn’t look very significant. This is again in line with my hypothesis but the effect is much smaller.
  • Regarding the higher education plans there is a clear correlation: in the group planning further education the proportion of high alcohol users is much smaller than in the other group. This is in line with my hypothesis.
  • Freetime is in some sense complementary to studytime, and the difference is also reflected here. Similarly, a lot of freetime doesn’t necessarily mean high alcohol usage, but very little free time seems to prevent it. As in case 1, this was not my hypothesis exactly, but some sort of inversion of it.

5. Logistic regression

  m = glm(high_use ~ studytime + activities + higher + freetime, data = df, family = "binomial")

  summary(m)
## 
## Call:
## glm(formula = high_use ~ studytime + activities + higher + freetime, 
##     family = "binomial", data = df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6471  -0.8668  -0.6585   1.1902   2.1362  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    -0.0841     0.7071  -0.119 0.905327    
## studytime      -0.5273     0.1577  -3.344 0.000826 ***
## activitiesyes  -0.2283     0.2400  -0.951 0.341541    
## higheryes      -0.7545     0.5398  -1.398 0.162162    
## freetime        0.3340     0.1246   2.680 0.007359 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 423.64  on 365  degrees of freedom
## AIC: 433.64
## 
## Number of Fisher Scoring iterations: 4

The model suggests that the studytime and freetime variables are statistically significantly useful for predicting high-low alcohol usage. It is interesting that the higher education prediction is not significant, but I double checked the data, and it is severely skewed towards students planning higher education (16 students not planning while 354 planning). This means that this variable has not much affect on predicting the higher education planning but yet high alcohol consumers. The categorical variables (activities and higher) were converted internally to “dummy” representative levels. As there was only 2 levels in both cases only 1 dummy variable was created.

6. Prediction evaluation

coef(m)
##   (Intercept)     studytime activitiesyes     higheryes      freetime 
##   -0.08410415   -0.52727634   -0.22825310   -0.75452928    0.33399121
exp(coef(m))
##   (Intercept)     studytime activitiesyes     higheryes      freetime 
##     0.9193355     0.5902103     0.7959228     0.4702319     1.3965309

The coefficients represent the correlation between the variable and the output, the sign represents the shape of the logistic function, i.e. negative coefficient means that higher value (e.g. studytime) results in lower output (i.e. no high alcohol usage). In the exponential form these represent odds ratios for the unit change, in other words associating the variable’s importance in effecting the model. Values close to 1 would have no effect, deviations in either direction shows that the variable is important for the prediction.

confint(m)
## Waiting for profiling to be done...
##                     2.5 %     97.5 %
## (Intercept)   -1.47694251  1.3160333
## studytime     -0.84575068 -0.2261421
## activitiesyes -0.70031246  0.2420972
## higheryes     -1.84750298  0.3028107
## freetime       0.09312641  0.5827670

These are 95% confidence intervals. It is usually understood that if these are not containing 0 then it is a significant result and indeed this matches with the significance table of the model. The results are matching with my hypothesis except for the higher education variable but that is explained by the skewing.

# Use only the significant variables
  m2 = glm(high_use ~ studytime + freetime, data = df, family = "binomial")
  probabilities <- predict(m2, type = "response")
  
  df %<>% mutate(probability = probabilities)
  df %<>% mutate(prediction = if_else(probability>0.5,TRUE,FALSE))
  
  # tabulate the target variable versus the predictions
  tt = table(high_use = df$high_use, prediction = df$prediction)
  
  tt
##         prediction
## high_use FALSE TRUE
##    FALSE   250    9
##    TRUE    102    9

This table is also called the confusion matrix. The usual metrics calculated from this table are the precision (TP/(TP+FP)) that is 0.5. This is pretty bad. Another common metric is recall (TP/(TP+FN)) that is 0.0810811 that is almost 0. Accuracy (the correct preddiictions / total predictions) was a bit better: 0.7. The training error is naturally 1-accuracy i.e. 0.3. This model performs pretty bad on predicting the high alcohol users, but pretty good on predicting the actually non high consumers. Simple guessing (without any other knowledge) should give a value around 0.5 hence this model is still a little better than random guess.


Clustering and classification

1-2. Data characteristics

  # load the data
  data("Boston")
  df = as_tibble(Boston)

In this exercise we are working with a dataset that is readily available in the MASS (Modern Applied Statistics with S) package of R and contains socio-economic features of Boston suburbs. More info on the data can be found at: https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/Boston.html. The raw data consists of 506 records that refer to different suburbs and 14 variables including for instance the crime rate of the town, the average age of buildings etc.

  str(df)
## tibble [506 x 14] (S3: tbl_df/tbl/data.frame)
##  $ crim   : num [1:506] 0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num [1:506] 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num [1:506] 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int [1:506] 0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num [1:506] 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num [1:506] 6.58 6.42 7.18 7 7.15 ...
##  $ age    : num [1:506] 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num [1:506] 4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int [1:506] 1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num [1:506] 296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num [1:506] 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num [1:506] 397 397 393 395 397 ...
##  $ lstat  : num [1:506] 4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num [1:506] 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

3. Graphical overview

Let’s have a look at the distribution of variables in this data by plotting them pairwise on a scatter plot.

  p = ggpairs(df, lower = list(combo = wrap("facethist", bins = 20)),
           upper = list(continuous = wrap("cor", size = 2)) ) +
  theme(text = element_text(size = 8), axis.text.x = element_text(angle = 45))

  p

  # It's a bit hard to see with the ggpairs in this case, however more informative then the simple corrplot.
  
  cor_matrix <- cor(df) 

  corrplot(cor_matrix, method="circle")

The variables follow various distributions. For instance the crim variable that encodes for crime rates in town seems to follow some sort of power distribution, i.e. there are some suburbs with very high crime rates and then the others are rather low. On the other hand the rm variable that encodes room number clearly has a normal distribution centered around 6 rooms per dwelling (that feels a bit too high though). Yet again, indus and tax that encode for “proportion of non-retail business” i.e. how industrialized a town is and “full-value property tax rate” respectively, seem to have a bi-modal distribution suggesting that for instance there the industrialized suburbs separate from the living quarters.

One of the strongest correlation is related to this industrial variable, it is negatively correlated to the dis variable that measures “weighted mean of distances to five Boston employment centres.” This makes sense, likely the industrialized (including also likely white collar businesses) areas are employment centers, so the less industrialized a district is the further away it is from these employment centers. The nitrogen oxid concentration is strongly positively correlated to this distance meaning that these employment centers are not high in NO concentration, suggesting that the employment centers in boston are high-tech, non-polluting industries.

4. Scaling

As all of these variables are continuous let’s scale them to have 0 mean and 1 standard deviation, by

\[scaled(x) = \frac{x - mean(x)}{ sd(x)}\]

df_scaled = as_tibble(scale(df))
summary(df_scaled)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865

As it was expected the means became 0.

bins = quantile(df_scaled$crim)

# create a categorical variable 'crime'
df_scaled %<>% {mutate(.,crim = cut(.$crim, breaks = bins, include.lowest = TRUE))}

table(df_scaled$crim)
## 
## [-0.419,-0.411]  (-0.411,-0.39] (-0.39,0.00739]  (0.00739,9.92] 
##             127             126             126             127
# The uniform distribution makes sense, as we divided by quantiles.

n = nrow(df_scaled)
ind = sample(n,  size = n * 0.8)
train = df_scaled[ind,]
test = df_scaled[-ind,]

correct_classes = test$crim
test %<>% dplyr::select(-crim)

5. LDA model

lda.fit <- lda(crim ~ ., data = train)

lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(train$crim)

# plot the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 1)

The LDA suggests that the “access to radial highways” rad is positively correlated with the highest crime-rate group.

6. Prediction

lda.pred <- predict(lda.fit, newdata = test)

# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
##                  predicted
## correct           [-0.419,-0.411] (-0.411,-0.39] (-0.39,0.00739] (0.00739,9.92]
##   [-0.419,-0.411]              18              7               0              0
##   (-0.411,-0.39]                4             18               5              0
##   (-0.39,0.00739]               0              7              17              1
##   (0.00739,9.92]                0              0               0             25

The confusion matrix shows that this model is pretty good in predicting the correct crime rate, the highest values are found in the diagonal. It is also visible that the easiest is to predict the highest crime rate (blue color on the LDA plot), as the LDA transformation has separated that group the best from others. The 3 lower crime group rate are more mixed hence more errors are made during the prediction.

7. K-means clustering

# We don't need reload just do this again
df_scaled = as_tibble(scale(as_tibble(Boston)))
dist_eu = dist(df)
# I'm a bit confused, I don't think this is needed for the k-means function, but the assignment requires to calculate it. I think k-means will do this on its own.

km = kmeans(df_scaled, centers = 4)

pairs(df_scaled, col = km$cluster)

Not much is visible on these figures. I unfortunately tend to see 2 clusters mostly along the black variable suggesting segregation in this area.

set.seed(123)

# Let's not overdo it, k-means is a fairly expensive technique
k_max <- 10
twcss <- sapply(1:k_max, function(k){kmeans(df_scaled, k)$tot.withinss})

# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')

km <- kmeans(df_scaled, centers = 2)
p = ggpairs(df_scaled %>% mutate(cluster = as_factor(km$cluster)), mapping = aes(color = cluster, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)),
           upper = list(continuous = wrap("cor", size = 2)) )
p
## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

Indeed, based on the dropping WCSS (within cluster sum of squares, a measure of coherency within a cluster, but note: clustering is by definition dubious!) two clusters is a good number in this data, and it reveals segregated societal groups (districts) with clear separation in crime rates and lower status of the population (lstat). It’s interesting that the nitrogen-oxid pollution is actually better for the group with lower status. The lower status obviously pays less taxes (connected to likely less income), lives in older houses. The most intersting is the distribution of the average room numbers, the joint normal distribution is composed of two kindof “heavy-tailed” distribution to opposite directions: the lower status with fewer rooms, and the upper status with more rooms.

I personally like PCA so let’s visualize also with that (even though this is not listed as an extra point exercise)

 pco = prcomp(df_scaled)

pco$x %>% as_tibble() %>% mutate(cluster = as_factor(km$cluster)) %>% ggplot() +
  aes(x = PC1, y = PC2, color = cluster) +
  geom_point()

And indeed: PC1 picks up exactly the same difference what was achieved with clustering, however suggests that the separation is not binary in the full data (there are features that mix these societes together).


Dimensionality reduction

1. Data characteristics

  # load the data
  df = read.csv('data/human2.csv',row.names = 1)
  # This is how the rowname headers are read back

  ggpairs(df)  

  summary(df)
##     Edu2.FM          Labo.FM          Edu.Exp         Life.exp    
##  Min.   :0.1717   Min.   :0.1857   Min.   : 5.40   Min.   :49.00  
##  1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:11.25   1st Qu.:66.30  
##  Median :0.9375   Median :0.7535   Median :13.50   Median :74.20  
##  Mean   :0.8529   Mean   :0.7074   Mean   :13.18   Mean   :71.65  
##  3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:15.20   3rd Qu.:77.25  
##  Max.   :1.4967   Max.   :1.0380   Max.   :20.20   Max.   :83.50  
##       GNI            Mat.Mor         Ado.Birth         Parli.F     
##  Min.   :   581   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  4198   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 12040   Median :  49.0   Median : 33.60   Median :19.30  
##  Mean   : 17628   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  3rd Qu.: 24512   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :123124   Max.   :1100.0   Max.   :204.80   Max.   :57.50

In this exercise we are working with countries and data related to their human development. The data consists of 155 observations over 8 variables. The variables are all continuous but have various distributions, e.g. the expected years of education Edu.Exp is rather normal, whilst the Gross National Income (GNI) is more like a power distribution. Some variables are positively (e.g. expected years of education and life expectancy) and some are negatively (e.g. life expectancy and adolescent birth rate) correlated.

2. Principal component analysis

pca_human = prcomp(df)

biplot(pca_human, choices = 1:2, cex = c(0.5, 0.8), xlab = 'Gross national income')
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

# Explained variance proportion
sprintf('%.2f%% ',pca_human$sdev^2/sum(pca_human$sdev^2)*100)
## [1] "99.99% " "0.01% "  "0.00% "  "0.00% "  "0.00% "  "0.00% "  "0.00% " 
## [8] "0.00% "
# Can be seen also in
summary(pca_human)
## Importance of components:
##                              PC1      PC2   PC3   PC4   PC5   PC6    PC7    PC8
## Standard deviation     1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912 0.1591
## Proportion of Variance 9.999e-01   0.0001  0.00  0.00 0.000 0.000 0.0000 0.0000
## Cumulative Proportion  9.999e-01   1.0000  1.00  1.00 1.000 1.000 1.0000 1.0000

3. Standardized PCA

df_std = scale(df)
pca_human = prcomp(df_std)

biplot(pca_human, choices = 1:2, cex = c(0.5, 0.8), xlab = 'Human development', ylab = 'Gender equality')

# Explained variance proportion
sprintf('%.2f%% ',pca_human$sdev^2/sum(pca_human$sdev^2)*100)
## [1] "53.60% " "16.24% " "9.57% "  "7.58% "  "5.48% "  "3.60% "  "2.63% " 
## [8] "1.30% "
# Can be seen also in
summary(pca_human)
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.0708 1.1397 0.87505 0.77886 0.66196 0.53631 0.45900
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595 0.02634
## Cumulative Proportion  0.5361 0.6984 0.79413 0.86996 0.92473 0.96069 0.98702
##                            PC8
## Standard deviation     0.32224
## Proportion of Variance 0.01298
## Cumulative Proportion  1.00000

The results are clearly different. In the non-standardized case the first principal component accounts for almost 100% of the variation. This coincides with the GNI, as without normalization it’s values are overpowered over other variables. However in the normalized case the variables contribute more equally to the principal components. Moreover the plot seems to meaningfully cluster countries, e.g. the Nordic countries are grouped on the top-left, and indeed their human development is rather similar. I added my interpretation to the axis labels, however note that the directionality of PC is not well defined and the axes could be equally well mirrored. Specifically, even though I named PC1 as human development (in the scaled case), but actually the human development is decreasing from left to right.

4. Interpretations

I have already given my interpretation briefly by naming the axes above.

  • PC1 seems to pick up human development differences such as state of education, life expectancy or the severeness of maternal mortality.
  • PC2 seems to differentiate countries based on gender equality (showed by the variables of labour market share and parliementary representation of women). Indeed there may be countries that are developed however female rights are not meeting liberal standards (e.g. Qatar, just think about the World Cup scandals).

5. Multiple correspondance analysis (Tea)

In this exercise we will work with another dataset on human participants’ tea consumption habits.

tea <- as_tibble(read.csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/tea.csv", stringsAsFactors = TRUE))

str(tea)
## tibble [300 x 36] (S3: tbl_df/tbl/data.frame)
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int [1:300] 39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "+60","15-24",..: 4 5 5 2 5 2 4 4 4 4 ...
##  $ frequency       : Factor w/ 4 levels "+2/day","1 to 2/week",..: 3 3 1 3 1 3 4 2 1 1 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
tea_time = tea %>% dplyr::select(c("Tea", "How", "how", "sugar", "where", "lunch"))

#View(tea) # I'd rather not use it in the RMarkdown
# Just visualize
pivot_longer(tea_time, cols = everything()) %>% 
  ggplot(aes(value)) + facet_wrap("name", scales = "free") +
  geom_bar() + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))

The data consists of 300 observations and dim(tea)[2] variables.

mca = MCA(tea_time, graph = TRUE)

summary(mca)
## 
## Call:
## MCA(X = tea_time, graph = TRUE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6   Dim.7
## Variance               0.279   0.261   0.219   0.189   0.177   0.156   0.144
## % of var.             15.238  14.232  11.964  10.333   9.667   8.519   7.841
## Cumulative % of var.  15.238  29.471  41.435  51.768  61.434  69.953  77.794
##                        Dim.8   Dim.9  Dim.10  Dim.11
## Variance               0.141   0.117   0.087   0.062
## % of var.              7.705   6.392   4.724   3.385
## Cumulative % of var.  85.500  91.891  96.615 100.000
## 
## Individuals (the 10 first)
##                       Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## 1                  | -0.298  0.106  0.086 | -0.328  0.137  0.105 | -0.327
## 2                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 3                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 4                  | -0.530  0.335  0.460 | -0.318  0.129  0.166 |  0.211
## 5                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 6                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 7                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 8                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 9                  |  0.143  0.024  0.012 |  0.871  0.969  0.435 | -0.067
## 10                 |  0.476  0.271  0.140 |  0.687  0.604  0.291 | -0.650
##                       ctr   cos2  
## 1                   0.163  0.104 |
## 2                   0.735  0.314 |
## 3                   0.062  0.069 |
## 4                   0.068  0.073 |
## 5                   0.062  0.069 |
## 6                   0.062  0.069 |
## 7                   0.062  0.069 |
## 8                   0.735  0.314 |
## 9                   0.007  0.003 |
## 10                  0.643  0.261 |
## 
## Categories (the 10 first)
##                        Dim.1     ctr    cos2  v.test     Dim.2     ctr    cos2
## black              |   0.473   3.288   0.073   4.677 |   0.094   0.139   0.003
## Earl Grey          |  -0.264   2.680   0.126  -6.137 |   0.123   0.626   0.027
## green              |   0.486   1.547   0.029   2.952 |  -0.933   6.111   0.107
## alone              |  -0.018   0.012   0.001  -0.418 |  -0.262   2.841   0.127
## lemon              |   0.669   2.938   0.055   4.068 |   0.531   1.979   0.035
## milk               |  -0.337   1.420   0.030  -3.002 |   0.272   0.990   0.020
## other              |   0.288   0.148   0.003   0.876 |   1.820   6.347   0.102
## tea bag            |  -0.608  12.499   0.483 -12.023 |  -0.351   4.459   0.161
## tea bag+unpackaged |   0.350   2.289   0.056   4.088 |   1.024  20.968   0.478
## unpackaged         |   1.958  27.432   0.523  12.499 |  -1.015   7.898   0.141
##                     v.test     Dim.3     ctr    cos2  v.test  
## black                0.929 |  -1.081  21.888   0.382 -10.692 |
## Earl Grey            2.867 |   0.433   9.160   0.338  10.053 |
## green               -5.669 |  -0.108   0.098   0.001  -0.659 |
## alone               -6.164 |  -0.113   0.627   0.024  -2.655 |
## lemon                3.226 |   1.329  14.771   0.218   8.081 |
## milk                 2.422 |   0.013   0.003   0.000   0.116 |
## other                5.534 |  -2.524  14.526   0.197  -7.676 |
## tea bag             -6.941 |  -0.065   0.183   0.006  -1.287 |
## tea bag+unpackaged  11.956 |   0.019   0.009   0.000   0.226 |
## unpackaged          -6.482 |   0.257   0.602   0.009   1.640 |
## 
## Categorical variables (eta2)
##                      Dim.1 Dim.2 Dim.3  
## Tea                | 0.126 0.108 0.410 |
## How                | 0.076 0.190 0.394 |
## how                | 0.708 0.522 0.010 |
## sugar              | 0.065 0.001 0.336 |
## where              | 0.702 0.681 0.055 |
## lunch              | 0.000 0.064 0.111 |

If two variables appear close on the variable representation plot it means that those properties seem to cluster / differentiate the participants. This is expanded on the MCA factor map, where one can see that people drinking tea in tea shops are drinking unpackaged tea that makes sense. However if someone is getting from the chain store then it’s more likely to buy it in tea bag.